CS 150 - Lab 8

The purpose of this lab is to:

Find the k smallest elements in a list..
Work with Sets and Dictionaries.
Practice reading from files and string manipulation.
Do fun stuff with words.

Getting Started

Dictionaries

We have talked about dictionaries n class. Here is a brief reminder of the main properties of dictionaries.

  Ages = {}                        # sets Ages to be an empty dictionary
  Ages["Hermione"]                # returns the values associated with "Hermione", presumbly
									 # her age.  Throws an error if "Hermione" is not a key.
  Ages["Hermione" = 18           # Makes "Hermione" a key and associates 18 with it.
  del Ages["Hermione"]           # removes key "Hermione" and the value associated with it.
  Ages.keys())                     # returns a "view" of the keys of Ages. You can treat this
									 # like a list of the keys.
  len(Ages)                        # returns the number of keys in Ages
  for person in Ages :            # Iterates over the keys in Ages.

Distilling Text

Consider the following text:

   Question:
   Whether nobler mind suffer
   Slings arrows outrageous fortune
   Take arms against sea troubles
   By opposing them.

You probably recognized this as a condensed version of the first few lines of Hamlet's famous "To be, or not to be" soliloquy. The original unedited text is:

   To be, or not to be -- that is the question:
   Whether 'tis nobler in the mind to suffer
   The slings and arrows of outrageous fortune
   Or to take arms against a sea of troubles
   And by opposing end them.

The former was produced by finding the 30 most commonly used words in the speech and removing them. Your first challenge is to write a program called distill.py that prompts the user for the name of a text file and a number n, and prints the contents of that text file with the n most common words removed.

Program Outline

The details of implementing a solution are up to you, but here is a suggested outline of how to approach the problem. As usual, think about the 6 steps of program development, and test each piece as you go.

Start by writing the interaction with the user. You need to get from the user a file name and a number , which will be the size of the frequent word list. Read in the file line by line and print it, just to make sure you can access it. Here are some files for you to try: hamlet.txt, lincoln.txt, prufrock.txt, FoxInSocks.txt, jabberwocky.txt. "hamlet.txt" is the 'to be or not to be' soliloquy, "lincoln.txt" is the Gettysburg address, "prufrock.txt" is Eliot's poem 'The Love song of J. Alfred Prufrock', "FoxInSocks.txt" is the Dr. Seuss book; (note that it has a very restricted set of words, so don't remove too many of them) and "jabberwocky.txt" is the Lewis Carrol poem.
Create a dictionary Count to keep track of the word counts for the given text file. The keys of this dictionary will be words, the value associated with each word will be the number of times it appears. Just as you did with the Concordance program, make a pass through the file; split each line into words and add each word to the dictonary. The first time you see the word (it will not be one of the keys) add it to the dictionary with value 1. Each subsequent time (it will be a key), increment its value by 1.
If you print the keys of the dictionary you will see that some have punctuation attached to them, some are capitalzed and some aren't, and so forth. We'll remove that just as we did with the concordance in Lab 6. Write a function cleanstring(s) that takes the lower-case version of string s, removes the punctuation marks, and returns the result. The easy way to strip off punctuation from string s is to make a string punct that has all of the punctuation marks you want to remove. For example, you might use punct = ".,;!";. Then s.strip(punct) has all of the characters in punct removed from the start and end of s,and returns the result. Instead of adding word w to the dictionary, look at cleanstring(w). If this is the empty string (you can have strings that are all punctuation marks; after you remove the punctuation there is nothing left), ignore it. If cleanstring(w) is not empty add it to the dictionary.
Once you've built the Count dictionary, you need to find a way to get a list of the n most common words. Here one way to do this; you may want to write a function to handle it. Start by dumping the dictionary into a list of [word, count] pairs: [('to', 15), ('be', 10), ('or', 8) ....] Then process this list as follows: start by finding the entry of the list that has the largest count, and switching it with the index 0 entry of the list. Then start at index 1, find the largest remaining entry, and switch it with the index 1 entry. Repeat this for the first n entries.When you are done the first portion of the list for hamlet.txt should look like [('the', 20), ('to', 15), ('of', 15), ('and', 12), ('that', 7),...] Make a final pass thorugh this list pulling off the first n words into their own list: ['the','to','of','and','that',..] This is our listcommonword..
You need to go through the file again. Either reopen it (F = open ...)) or go back to the top with F.seek(0), where F is your file variable. Process the filea line at a time, printing each word that doesn't appear in commonwords. When you are deciding whether to print a word or not, you'll want to look at the "cleaned" version of the word, but you should print the original version, including the punctuation attached to it. You can decide what you want to do with words whose cleanstring( ) version is empty (words consisting only of punctuation characters) -- you can print them or leave them out, as you wish. Several of the sample text files we give you are poetry, with the first letter of each line capitalized. You can do that or not with your output, as you wish. If you want to do it you can use the string method capitalze on the first word of the line: If s is a string, s.capitalize() is s with the first letter letter converted to upper case.

Note that your output may differ slightly from what we have shown due to ties in word counts.

Sets

A set is another built-in data structures supported by Python for the mathematical notion of a set, i.e. a collection of elements. Unlike a dictionary, the elements in a set don't have values associated with them. You could simulate a set using a dictionary, by adding a key for each element, and setting that key's value to something arbitrary, like 0, or an empty string, or none. That said, if you don't have data associated with each element, and simply whant to keep track of a set of items, using a set is the way to go.

Like dictionarys (and unlike lists), sets are not ordered, but testing membership and addinging or removing elements is very fast. Sets do not store duplicate elements: adding an element to a set that already contains that element has no effect.

Here are some examples of syntax involving sets.

  team = set()                       # makes a set with 0 elements
  team = {"kirk", "spock"}           # makes a set with 2 elements
  len(team)                          # 2 
  team.add("bones")                  # adds "bones" to team
  team.remove("kirk")                # removes "kirk" from team
  for p in team :                    # iterates through elements of team
  "bones" in team                    # True
  "malcolm" in team                  # False
  "river" not in team                # True

Anagrams

An anagram is just a rearrangement of the letters in the word to form another word or words. For example, here are some anagrams for the phrase "oberlin student":

let none disturb
run no bed titles
let us not rebind
trust line on bed
but not red lines
bound in letters
let in; runs to bed

For this part of the lab, you will write a program called anagrams.py that reads in a file so it knows what strings are words in English, and then reads phrases from the user and prints anagrams for them. Your program should prompt the user for the dictionary file, then go into a loop reading strings and printing anagrams for them.

Program Outline

Your program should take the following steps:.

Read in a text document containing a word list. Here are two: words1.txt, words2.txt. The first is very small, just for testing purposes. The second contains about 4000 common words. When you start writing the program you might as well handle the entire user interaction. After you read in the word list go into a loop asking the user for a string s and finding its anagrams. When the user gives a blank line, quit the program. The algorithm we will use for finding anagrams won't handle spaces so you need to remove the spaces from the input string. An easy way to do that is s = s.replace(" ", "") i.e., replace all space characters with the empty string.
Build a set words containing each word from the word list file. Since we have a lot of words, using a set rather than a list will save us a lot of time when testing membership (which is basically all we'll be using it for).
Create a function called contains(s, word) which returns a pair of values. The first value should be a boolean indicating whether the string s contains the letters necessary to spell word. If the answer is True, the second value should be what remains of s after the letters in word have been removed. If the first answer is False, the second value returned should just be an empty string. For example,
```
	contains("zombiepig", "bozo")       # returns False, ""
	 contains("zombiepig", "biz")        # returns True, "omepig"
	
```
Create a recursive function called grams(s, words, sofar) that takes in a string s, a set of words words, and a list of words sofar. This function tries to find all anagrams of s using elements found in words. Each time it does find an anagram for s it prints all the words in sofar.

You might be wondering why we're passing around the variable sofar. Indeed, when we want to find the anagrams of a string given by the user, we'll pass in an empty list. However, that list will be critical for making use of recursion. Let's look at an example to see why. Suppose we want to find anagrams of the word
```
    robopirate
```
To do this we look through our wordlist for words that are contained in this string. The string "cat" doesn't appear in "robopirate", but "air" does. So one thing our function call will do is begin looking through the remainder of "robopirate" with "air" removed, looking for further anagrams. That is, it'll continue to look for strings contained in
```
    robopte 
```
Our list includes "bro", which is contained in "robopte", so another recursive call will be made on the remains, namely "opte". Our wordlist contains "poet", leaving us with an empty string. At this point we've used up all the letters in the string, so we have an anagram, namely
```
 air bro poet 
```
Unfortunately, if we want to print our anagram, we're in trouble, since we haven't kept a record of the previous words we found. (Why couldn't we have just printed words as we found them?) That's where sofar comes in. This list will track the words we've found so far in this particular branch of the recursion. That is,
```
 grams("robopirate", words, []) 
```
will call (among other things)
```
grams("robopte", words, ["air"]) 
```
which in turn calls
```
grams("opte", words, ["air", "bro"]) 
```
which in turn calls
```
grams("", words, ["air", "bro", "poet"])
```
which can now print the complete anagram.

With this in mind, we're ready to describe the overall structure of this function. We loop over every word w in our wordlist. For each word w that's found in our string s, we make a recursive call on the remainder of s, and with a new list, equal to the current list with w added on. We need to make a new list, because there is no easy way to remove w from the list when we move on to another word. If we recurse to the point where s is the empty string, we can just print the contents of sofar.

Test output

If you run your program using words1.txt for your word list on the string "robopirate", you should get

   or ape orbit
   or orbit ape
   bro air poet
   bro poet air
   air bro poet
   air poet bro
   ape or orbit
   ape orbit or
   poet bro air
   poet air bro
   orbit or ape
   orbit ape or

It doesn't matter if your output is in another order. Notice that for any set of words that form an anagram, the program prints this set once for each possible ordering of the words. There are k! orderings for a set of k words. In this robopirate example there are two sets of words making the anagrams: {'or", 'ape', 'orbit'} and {'poet', 'air', 'bro'}. The twelve lines of output consist of each of these sets printed 6 times (since 6 =3!). When printing anagrams of "oberlin student" with the words2.txt dictionary, I had over 35,000 lines of output.

Here are some tests for the larger words2.txt file:

"frodo baggins" has anagrams that include "go for bad sign"

"hermione granger" has anagrams that include "ignore green harm"

"ron weasley" has anagrams that include "as we rely on"

"oberlin conservatory" has many anagrams, including "so convert one library", "naive err controls boy", "only recover into bars", "obtain no clever sorry", "lost, recover no binary", "boy never controls air", and "be sorry; naive control". Don't tell Dean Kalyn about that last one.

Improvements

Once that's working, you may optionally add in the following extensions. Note that these are not part of the assignment, just fun and intersting extensions.

Things slow down a lot when we have longer word lists. One nice optimization makes use of the following observation: if the user's string (s) doesn't contain a particular word (w), then no remainder of s will contain w either. So instead of iterating through the set of all words at each step of the recursion, we need only iterate through the "plausible" words.
Add a "preprocessing" step: Instead of adding every string found in the word file to the set words, only add those that are contained in the user's input string. Fortunately you already have created a function that can help you out here.
Let the user specify a minimum length of words allowable in their anagrams. Of course, you will want to eliminate the shorter words from your words set.
Let the user specify a maximum number of words allowable in their anagrams.

Handin

If you followed the Honor Code in this assignment, create an HonorCodefile containing the text.

I affirm that I have adhered to the Honor Code in this assignment.

You now just need to electronically handin all your files. As a reminder

 
     % cd             # changes to your home directory
     % cd cs150       # goes to your cs150 folder
     % handin         # starts the handin program
                      # class is 150
                      # assignment is 8
                      # file/directory is lab08
     % lshand         # should show that you've handed in something

You can also specify the options to handin from the command line

 
     % cd ~/cs150     # goes to your cs150 folder
     % handin -c 150 -a 8 lab08

File Checklist

You should have submitted the following files:

   distill.py
   anagrams.py

	 README.txt  (with the Honor Pledge)